Analysing Dengue Cases in Singapore

Problem Statement

In our Capstone Project, we aim to explore the relationship between meteorological conditions (Air temperature and rainfall) of Singapore and clinical conditions commonly seen in local community – Dengue fever (DF), hand food mouth disease (HFMD), upper respiratory tract infection (URTI) and diarrhea. Specifically, does higher temperature and rainfall associated with the rise in clinical conditions in the community? Furthermore, we are interested to investigate if types of housing is associated with the number of Dengue fever cases.

Dengue fever is a vector-borne infectious disease that is endemic in the tropical world. Singapore is one of several countries with high disease burden of dengue. In 2020, Singapore saw 1,158 dengue cases in a week of June - the highest number of weekly dengue cases ever recorded since 2014. The sudden increase is worrisome, therefore greater emphasis on analysis will be focused on Dengue fever.

For our data science project, we activated the following packages, using the Tidyverse approach.

# Rule for activating packages in development mode:
# - Specify everything fully
# - Exceptions:
#   - ggfortify - to overload `ggplot2::autoplot()`
#   - ggmap - has to be activated for `ggmap::register_google()` to work
#   - ggplot2 - some function arguments require `ggplot2::`, it's crazy
#   - magrittr - `%>%`
#
# For deployment mode, activate every package that is referenced more than once
# - Add package name here, then use find-and-replace to remove "<package>::"
# - Allowed to activate tidyverse
#   - Remember what are the 8 packages?

pacman::p_load(
  ggfortify,
  # ggmap,
  ggplot2,
  magrittr
  # tidyverse
)

Import

Then, we imported our dataset.

Most time-consuming, finding data, is.

We work towards 2 datasets: data_time for temporal analysis, and data_space for spatial analysis. When they are ready, the import-tidy-transform process would have been completed (at least for the first round of transform, subsequent rounds of transforms might still be required depending on the exigencies informed by the modeling and visualization processes).

All data are available online and were gathered into 7 distinct raw datasets (click here for a summary page):

The full glorious details of their providence are given below.

Helper functions

The following functions must be activated no matter which method of data import is chosen.

import_moh_weekly <- function(url_or_path = NULL) {
  #' Weekly Infectious Diseases Bulletin
  #' 
  #' @description
  #' \href{https://www.moh.gov.sg/resources-statistics/infectious-disease-
  #' statistics/2020/weekly-infectious-diseases-bulletin}{MOH Weekly Infectious 
  #' Disease Bulletin} from the Ministry of Health (MOH). 
  #' 
  #' \href{https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/
  #' weekly-infectious-disease-bulletin-year-2020
  #' d1092fcb484447bc96ef1722b16b0c08.xlsx}{Latest data as of 31 July 2020 
  #' (2012-W01 to 2020-W30)}.
  #' 
  #' @param url_or_path The URL or file path of the .xlsx file.
  #' @return Weekly infectious diseases bulletin (2012-W01 to 2020-W30).
  
  # If no URL or path is specified, try to get the file directly from MOH.
  if (is.null(url_or_path)) {
    url_or_path = paste0(
      "https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/",
      "weekly-infectious-disease-bulletin-year-2020",
      "d1092fcb484447bc96ef1722b16b0c08.xlsx"
    )
  }
  
  # Check if the given path is a URL by trying to download to a temp file. If 
  #   successful, return the temp file. If not, return the original path.
  xlsx_file = tryCatch({
    temp = tempfile(fileext = ".xlsx")
    download.file(url_or_path, destfile = temp, mode = "wb")
    temp
  }, error = function(e) {
    url_or_path
  })
  
  # Columns will be renamed to follow 2020
  colnames_2020 = c(
    "Campylobacter enterosis" = "Campylobacter enteritis",
    "Campylobacterenterosis" = "Campylobacter enteritis",
    "Campylobacteriosis" = "Campylobacter enteritis",
    "Chikungunya Fever" = "Chikungunya",
    "Dengue Haemorrhagic Fever" = "DHF",
    "Dengue Fever" = "Dengue",
    "Hand, Foot and Mouth Disease" = "HFMD",
    "Hand, Foot Mouth Disease" = "HFMD",
    "Nipah virus infection" = "Nipah",
    "Viral Hepatitis A" = "Acute Viral Hepatitis A",
    "Viral Hepatitis E" = "Acute Viral Hepatitis E",
    "Zika Virus Infection" = "Zika",
    "Zika virus infection" = "Zika"
  )
  
  xlsx_file %>%
    readxl::excel_sheets() %>% 
    lapply(function(sheetname) {
      df = readxl::read_xlsx(xlsx_file, sheetname, skip = 1)
      
      # Date formats are different for 2020 (dmy instead of mdy)
      if (sheetname == "2020") {
        df$Start = lubridate::dmy(df$Start)
        df$End = lubridate::dmy(df$End)
      }
      
      # Find and rename columns that need to be renamed, and rename them
      mapper = na.omit(colnames_2020[names(df)])
      dplyr::rename_with(df, ~mapper, names(mapper))
    }) %>% 
    dplyr::bind_rows() %>% 
    dplyr::rename(Epiweek = `Epidemiology Wk`) %>% 
    dplyr::mutate(Epiyear = lubridate::epiyear(Start)) %>% 
    dplyr::select(Epiyear, everything()) %>% 
    dplyr::arrange(Start)
}

read_kmls <- function(url_or_path) {
  #' Read Single or Multiple KML Files
  #' 
  #' @description
  #' There are (at least) 2 approaches to handling .kml data:
  #' \enumerate{
  #'   \item sp - rgdal::readOGR()
  #'   \item sf - sf::st_read()
  #' }
  #'
  #' The sp approach was abandoned as their objects were rather complex and 
  #'   did not facilitate method chaining.
  #'
  #' The sf approach produced objects that look like data.frames, which had 
  #'   better support for method chaining, but also some peculiarities:
  #' \itemize{
  #'   \item Dimension: set to XY using sf::st_zm()
  #'   \item Geographic CRS: use sf::`st_crs<-`("WGS84") World Geodetic 
  #'         Survey 1984
  #' }
  #' 
  #' @param url_or_path The URL(s) or file path(s) of the .kml file(s).
  #' @return A single combined sf object.
  
  # Check if the given paths are URLs by trying to download to temp files. If 
  #   successful, return the temp files. If not, return the original paths.
  kml_files = tryCatch({
    temp = tempfile(fileext = rep(".kml", length(url_or_path)))
    Map(function(x, y) download.file(x, y, mode = "wb"), url_or_path, temp)
    temp
  }, error = function(e) {
    url_or_path
  })
  
  kml_files %>% 
    lapply(sf::st_read, quiet = T) %>% 
    dplyr::bind_rows() %>% 
    tibble::as_tibble() %>% 
    sf::st_as_sf() %>% 
    sf::st_zm()
}

Import data from scratch (where available)

Webscraping functions

The following code chunk has been disabled, but is provided here to show the details behind the webscraping processes.

import_mss_daily <- function(years, stations = NULL) {
  #' Historical Daily Weather Records
  #' 
  #' @description
  #' \href{http://www.weather.gov.sg/climate-historical-daily/}{Daily weather 
  #' records} from the Meteorological Service Singapore (MSS). 
  #' Available data ranges from January 1980 to June 2020. Only 19 of the 63 
  #' climate stations are recognized by this function, because these contain 
  #' more than just rainfall data. \href{http://www.weather.gov.sg/wp-content/
  #' uploads/2016/12/Station_Records.pdf}{List of stations, weather parameters 
  #' and periods of records}.
  #' 
  #' MSS is nice enough to have their data in .csv files, with a systematic 
  #' naming scheme. Data compilation becomes as simple as generating the 
  #' appropriate list of URLs.
  #' 
  #' @param years A vector of the years of interest.
  #' @param stations A vector of the climate station names.
  #' @return Combined daily weather records.
  #' @examples
  #' import_mss_daily(2012:2020, "Changi")
  #' import_mss_daily(2012:2020, c("Changi", "Clementi", "Khatib", "Newton"))
  
  stations_lookup = c(
    "Admiralty" = "104_",
    "Ang Mo Kio" = "109_",
    "Changi" = "24_",
    "Choa Chu Kang (South)" = "121_",
    "Clementi" = "50_",
    "East Coast Parkway" = "107_",
    "Jurong (West)" = "44_",
    "Jurong Island" = "117_",
    "Khatib" = "122_",
    "Marina Barrage" = "108_",
    "Newton" = "111_",
    "Pasir Panjang" = "116_",
    "Pulau Ubin" = "106_",
    "Seletar" = "25_",
    "Sembawang" = "80_",
    "Sentosa Island" = "60_",
    "Tai Seng" = "43_",
    "Tengah" = "23_",
    "Tuas South" = "115_"
  )
  
  # Check that all provided station names are in the list, if not, exit and 
  #   print out the list (of names) for the user.
  mask = !(stations %in% names(stations_lookup))
  if (any(mask)) {
    stop("The following station names are not recognized:\n",
         paste(stations[mask], collapse = "\n"),
         "\n\nPlease select from the following:\n",
         paste(names(stations_lookup), collapse = "\n"))
  }
  
  # If no station names specified, take the full list
  if (is.null(stations)) {
    station_nums = stations_lookup
  } else {
    station_nums = stations_lookup[stations]
  }
  
  # The URL to each .csv file has the following format:
  # - "<url_base><station number>_<YYYY><MM>.csv"
  # - Notice the station numbers given above contain the "_" suffix
  url_base = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_S"
  
  # To enumerate all the URLs, take the Cartesian product of:
  # - station numbers
  # - years
  # - months (we'll always attempt to find all 12 months)
  #
  # Flatten the result into a vector, then prefix and suffix with `url_base` 
  #   and ".csv", respectively.
  
  # Base R
  urls = station_nums %>% 
    outer(years, FUN = paste0) %>% 
    outer(sprintf("%02d", 1:12), FUN = paste0) %>% 
    sort() %>%  # Flatten and arrange in alphabetical order
    paste0(url_base, ., ".csv")
  
  # Equivalent tidyverse approach (for reference)
  # urls = station_nums %>% 
  #   tidyr::crossing(years, sprintf("%02d", 1:12)) %>% 
  #   tidyr::unite("station_year_month", sep = "") %>% 
  #   dplyr::pull() %>% 
  #   paste0(url_base, ., ".csv")
  
  # We have a list of URLs, but some of them may not exist (some stations do 
  #   not have data for some months). Use tryCatch() to return a table for the 
  #   URLs that exist, and a NA for those that don't.
  #
  # Problems with multibyte characters and countermeasures:
  # - Some colnames contained the "degree sign" symbol which somehow made the 
  #   table contents inaccessible. Tables after April 2020 had slightly 
  #   different colnames.
  #   - Manually set the colnames for each table.
  # - Some tables contained the "em dash" symbol to indicate missing values.
  #   - They will be coerced to NA when the columns are type cast to numeric.
  #   - There will be warning messages.
  dfs = urls %>% 
    lapply(function(url) {
      tryCatch({
        df = readr::read_csv(url,
                             skip = 1,
                             col_names = c(
                               "Station",
                               "Year",
                               "Month",
                               "Day",
                               "Rainfall",
                               "Highest_30_min_rainfall",
                               "Highest_60_min_rainfall",
                               "Highest_120_min_rainfall",
                               "Mean_temp",
                               "Max_temp",
                               "Min_temp",
                               "Mean_wind",
                               "Max_wind"
                             ),
                             col_types = readr::cols_only(
                               "Station" = "c",
                               "Year" = "n",
                               "Month" = "n",
                               "Day" = "n",
                               "Rainfall" = "n",
                               "Mean_temp" = "n",
                               "Max_temp" = "n",
                               "Min_temp" = "n"
                             ))
        
        # Announce progress (UX is important! We can tolerate lower efficiency)
        message(paste(df[1, 1:3], collapse = "_"))
        
        df
      }, error = function(e) {
        NA
      })
    })
  
  dfs[!is.na(dfs)] %>% 
    dplyr::bind_rows() %>% 
    # Calculate daily temperature range
    dplyr::mutate(Temp_range = Max_temp - Min_temp,
                  .keep = "unused") %>% 
    # Calculate epidemiological years and weeks
    dplyr::mutate(Date = lubridate::ymd(paste(Year, Month, Day, sep = "-")),
                  Epiyear = lubridate::epiyear(Date),
                  Epiweek = lubridate::epiweek(Date),
                  .keep = "unused",
                  .after = Station) %>% 
    dplyr::arrange(Station, Date)
}

import_hcidirectory <- function() {
  #' Healthcare Institutions Directory
  #' 
  #' @description
  #' The \href{http://hcidirectory.sg/hcidirectory/}{Healthcare Institutions 
  #' (HCI) Directory}, an initiative by the Ministry of Health (MOH), is a 
  #' platform for all HCIs licensed under the Private Hospitals and Medical 
  #' Clinics (PHMC) Act to provide information about their services and 
  #' operations to the public.
  #' 
  #' This function is custom-made to consolidate the names and addresses of 
  #' HCIs which are medical clinics that offer general medical services.
  #' 
  #' The HCI Directory is a dynamic web page, so using RSelenium might be 
  #' required.
  #' 
  #' @return The names and addresses of selected HCIs.
  
  # Run a Selenium Server using `RSelenium::rsDriver()`. The parameters e.g. 
  #   `browser`, `chromever` (or `geckover` if using Firefox, or other drivers 
  #   if using other browsers) have to be properly set. Trial-and-error until a 
  #   configuration works. Set `check = T` the very first time it's run on a 
  #   system, then set `check = F` after that to speed things up.
  rD = RSelenium::rsDriver(browser = "chrome",
                           chromever = "83.0.4103.39",
                           check = F)
  
  # Connect to server with a remoteDriver instance.
  remDr = rD$client
  
  # Set timeout on waiting for elements
  remDr$setTimeout(type = "implicit", milliseconds = 10000)
  
  # Navigate to the given URL
  remDr$navigate("http://hcidirectory.sg/hcidirectory/")
  
  # Click 4 things:
  # 1. "MORE SEARCH OPTIONS"
  # 2. "Medical Clinics Only"
  # 3. "General Medical"
  # 4. "Search"
  c(
    "options" = "#moreSearchOptions",
    "medclins" = "#criteria > table > tbody > tr:nth-child(2) > td > label",
    "genmed" = "#isGenMed",
    "search" = "#search_btn_left"
  ) %>% 
    lapply(remDr$findElement, using = "css") %>% 
    purrr::walk(function(elem) elem$clickElement())
  
  # Find the number of pages
  results = remDr$findElement("#results", using = "css")
  npages = results$getElementAttribute("innerHTML")[[1]] %>% 
    xml2::read_html() %>% 
    rvest::html_node("#totalPage") %>% 
    rvest::html_attr("value") %>% 
    as.numeric()
  
  # Create an empty tibble to append results
  df = tibble::tibble(
    id = character(),
    name = character(),
    add = character()
  )
  
  i = 1
  while (T) {
    results = remDr$findElement("#results", using = "css")
    html = results$getElementAttribute("innerHTML")[[1]] %>% 
      xml2::read_html()
    
    # Determine the index numbers of the (up to 10) results on the page
    idx = html %>% 
      # Find the element that says "SHOWING 1 - 10 OF 1,761 RESULTS"
      rvest::html_nodes(".col1") %>% 
      .[1] %>% 
      rvest::html_text() %>% 
      # Commas have to be eliminated for numbers > 999
      gsub(",", "", .) %>% 
      # Find the smallest and largest numbers and apply the colon operator
      sub(".*Showing\\s+(.*)\\s+of.*", "\\1", .) %>% 
      strsplit(split = " - ") %>% 
      unlist() %>% 
      as.numeric() %>% 
      { .[1]:.[2] }
    
    # Only append results if IDs are not in the table
    if (!any(idx %in% df$id)) {
      df = df %>% 
        dplyr::bind_rows(
          html %>%
            # Find both the name and address nodes
            rvest::html_nodes(".name,.add") %>% 
            rvest::html_text() %>% 
            # Tidy whitespace
            gsub("\\s+", " ", .) %>% 
            trimws() %>% 
            # Concatenate IDs, odd rows (names), and even rows (addresses)
            { cbind(idx, .[c(TRUE,FALSE)], .[!c(TRUE,FALSE)]) } %>% 
            tibble::as_tibble() %>% 
            setNames(c("id", "name", "add"))
        )
      
      # Announce progress and increment page counter
      message(i, " of ", npages, " done (", round(i / npages * 100, 2), "%)")
      i = i + 1
    }
    
    # Natural exit point
    if (i > npages) break
    
    # Navigate to the next page (if available, else stop)
    the_end = tryCatch({
      nextpage = remDr$findElement("#PageControl > div.r_arrow", using = "css")
      nextpage$clickElement()
      F
    }, error = function(e) {
      print(paste("There are no more pages after", i))
      T
    })
    
    # Unnatural exit point
    if (the_end) break
  }
  
  # Clean up RSelenium
  remDr$close()
  rD[["server"]]$stop()
  rm(rD, remDr)
  gc()
  
  # Kill Java instance(s) inside RStudio
  # docs.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
  system("taskkill /im java.exe /f", intern = F, ignore.stdout = F)
  
  # Clean up:
  # - Franchises may have the same name with different addresses
  # - Different practices may have the same zipcodes and even buildings
  # - We will consider each full address unique, and a single practice
  
  # Clean up duplicate addresses
  df %>% 
    .[!duplicated(tolower(.$add), fromLast = T),]
}

zipcodes_to_geocodes <- function(zipcodes) {
  #' Get Geo-location from Google Maps
  #' 
  #' @description
  #' Attempt to obtain the longitudes, latitudes, and addresses of the given 
  #' zipcodes using ggmap::geocode().
  #' 
  #' @param zipcodes A vector of zipcodes.
  #' @return Geo-location data of the associated zipcodes.
  
  # Prompt user to input API key
  ggmap::register_google(key = readline("Please enter Google API key: "))
  
  # Create an (almost) empty tibble to append results
  res = zipcodes %>% 
    # Remove duplicates to minimize number of requests
    .[!duplicated(.)] %>% 
    tibble::as_tibble() %>% 
    dplyr::rename(zip = value) %>% 
    dplyr::mutate(lon = NA_real_,
                  lat = NA_real_,
                  address = NA_character_)
  
  for (i in 1:nrow(res)) {
    result = tryCatch({
      ggmap::geocode(res$zip[i], output = "latlona", source = "google")
    }, warning = function(w) {
      w$message
    }, error = function(e) {
      NA
    })
    
    # If the registered key is invalid, there's no point continuing
    if (grepl("The provided API key is invalid", result[1], fixed = T)) {
      stop("A valid Google API key is required.")
    }
    
    # A useful result will have something, and will have names
    if (!is.na(result) && !is.null(names(result))) {
      res$lon[i] = result$lon
      res$lat[i] = result$lat
      res$address[i] = result$address
    }
    
    # Announce progress
    message(i, " of ", nrow(res), " (",round(i / nrow(res) * 100, 2), "%)")
  }
  
  res
}

Import data de novo

The following code chunk has been disabled, but is provided here to show how the data may be obtained from primary sources (with one exception). Using this code chunk would require activating the previous code chunk, as well as a valid Google Maps Platform API key.

grand_import_de_novo <- function() {
  # This might take a while, and requires a Google Maps Platform API key
  
  list(
    "moh_bulletin" = import_moh_weekly(),
    
    "mss_19stations" = import_mss_daily(years = 2012:2020),
    
    "hci_clinics" = import_hcidirectory() %>% 
      dplyr::mutate(zip = sub(".*((?i)singapore\\s+\\d+).*", "\\1", add)) %>% 
      # Requires Google Maps Platform API key
      dplyr::left_join(zipcodes_to_geocodes(.$zip), by = "zip") %>% 
      dplyr::select(-id, -zip, -address) %>% 
      tidyr::drop_na(),
    
    "planning_areas" = read_kmls(paste0(
      "https://geo.data.gov.sg/plan-bdy-dwelling-type-2017/2017/09/27/kml/",
      "plan-bdy-dwelling-type-2017.kml"
    )),
    
    "dengue_polys" = read_kmls(
      c("central", "northeast", "southeast", "southwest") %>% 
        paste0("https://geo.data.gov.sg/denguecase-", .,
               "-area/2020/07/17/kml/denguecase-", ., "-area.kml")
    ),
    
    "aedes_polys" = read_kmls(c(
      c("central", "northeast", "northwest") %>% 
        paste0("https://geo.data.gov.sg/breedinghabitat-", .,
               "-area/2020/07/17/kml/breedinghabitat-", ., "-area.kml"),
      c("southeast", "southwest") %>% 
        paste0("https://geo.data.gov.sg/breedinghabitat-", .,
               "-area/2020/07/23/kml/breedinghabitat-", ., "-area.kml")
    )),
    
    "mss_63station_pos" = readr::read_csv(paste0(
      "https://raw.githubusercontent.com/roscoelai/dasr2020capstone/master/",
        "data/mss/Station_Records.csv"
    ))
  )
}

# raw_data <- grand_import_de_novo()

Import data from saved files

If the data files are available locally, set from_online_repo = F for faster loading. Otherwise, the file will be obtained from an online repository. The .csv files will be read directly while the other file formats will be downloaded to a temporary folder and read locally.

grand_import_no_webscraping <- function(from_online_repo = TRUE) {
  # Allow user to choose whether to import raw data from an online repository 
  #   or from local files.
  
  if (from_online_repo) {
    fld = paste0("https://raw.githubusercontent.com/roscoelai/",
                 "dasr2020capstone/master/data/")
  } else {
    # Check that the "../data/ folder exists
    assertthat::assert_that(dir.exists("../data/"),
                            msg = 'Unable to locate "../data/" directory.')
    fld = "../data/"
  }
  
  list(
    "moh_bulletin" = import_moh_weekly(paste0(
      fld, "moh/weekly-infectious-disease-bulletin-year-2020.xlsx"
    )),
    
    "mss_19stations" = readr::read_csv(paste0(
      fld, "mss/mss_daily_2012_2020_19stations_20200728.csv"
    )),
    
    "hci_clinics" = readr::read_csv(paste0(
      fld, "hcid/hci_clinics_20200725.csv"
    )),
    
    "planning_areas" = read_kmls(paste0(
      fld, "kmls/plan-bdy-dwelling-type-2017.kml"
    )),
    
    "dengue_polys" = read_kmls(paste0(
      fld, "kmls/denguecase-", c("central",
                                 "northeast",
                                 "southeast",
                                 "southwest"), "-area.kml"
    )),
    
    "aedes_polys" = read_kmls(paste0(
      fld, "kmls/breedinghabitat-", c("central",
                                      "northeast",
                                      "northwest",
                                      "southeast",
                                      "southwest"), "-area.kml"
    )),
    
    "mss_63station_pos" = readr::read_csv(paste0(
      fld, "mss/Station_Records.csv"
    ))
  )
}

raw_data <- grand_import_no_webscraping(from_online_repo = F)

A quick and dirty overview of the raw data

raw_data %T>% {
  print(tibble::tibble(names = names(.),
                       data = .,
                       nrow = sapply(., nrow),
                       ncol = sapply(., ncol)))
}
# A tibble: 7 x 4
  names             data                   nrow  ncol
  <chr>             <named list>          <int> <int>
1 moh_bulletin      <tibble [470 x 46]>     470    46
2 mss_19stations    <tibble [58,489 x 7]> 58489     7
3 hci_clinics       <tibble [1,739 x 4]>   1739     4
4 planning_areas    <tibble [55 x 3]>        55     3
5 dengue_polys      <tibble [1,203 x 3]>   1203     3
6 aedes_polys       <tibble [315 x 3]>      315     3
7 mss_63station_pos <tibble [63 x 9]>        63     9
$moh_bulletin
# A tibble: 470 x 46
   Epiyear Epiweek Start               End                 Cholera Paratyphoid
     <dbl>   <dbl> <dttm>              <dttm>                <dbl>       <dbl>
 1    2012       1 2012-01-01 00:00:00 2012-01-07 00:00:00       0           1
 2    2012       2 2012-01-08 00:00:00 2012-01-14 00:00:00       0           1
 3    2012       3 2012-01-15 00:00:00 2012-01-21 00:00:00       0           1
 4    2012       4 2012-01-22 00:00:00 2012-01-28 00:00:00       0           1
 5    2012       5 2012-01-29 00:00:00 2012-02-04 00:00:00       0           3
 6    2012       6 2012-02-05 00:00:00 2012-02-11 00:00:00       0           2
 7    2012       7 2012-02-12 00:00:00 2012-02-18 00:00:00       0           1
 8    2012       8 2012-02-19 00:00:00 2012-02-25 00:00:00       0           4
 9    2012       9 2012-02-26 00:00:00 2012-03-03 00:00:00       0           4
10    2012      10 2012-03-04 00:00:00 2012-03-10 00:00:00       0           2
# ... with 460 more rows, and 40 more variables: Typhoid <dbl>, `Acute Viral
#   Hepatitis A` <dbl>, `Acute Viral Hepatitis E` <dbl>, Poliomyelitis <dbl>,
#   Plague <dbl>, `Yellow Fever` <dbl>, Dengue <dbl>, DHF <dbl>, Malaria <dbl>,
#   Chikungunya <dbl>, HFMD <dbl>, Diphtheria <dbl>, Measles <dbl>,
#   Mumps <dbl>, Rubella <dbl>, SARS <dbl>, Nipah <dbl>, `Acute Viral hepatitis
#   B` <dbl>, Encephalitis <dbl>, Legionellosis <dbl>, `Campylobacter
#   enteritis` <dbl>, `Acute Viral hepatitis C` <dbl>, Leptospirosis <dbl>,
#   Melioidosis <dbl>, `Meningococcal Infection` <dbl>, Pertussis <dbl>,
#   `Pneumococcal Disease (invasive)` <dbl>, `Haemophilus influenzae type
#   b` <dbl>, `Salmonellosis(non-enteric fevers)` <dbl>, `Avian
#   Influenza` <dbl>, Zika <dbl>, `Ebola Virus Disease` <dbl>, `Japanese
#   Encephalitis` <dbl>, Tetanus <dbl>, Botulism <dbl>, `Murine Typhus` <dbl>,
#   `Acute Upper Respiratory Tract infections` <dbl>, `Acute
#   Conjunctivitis` <dbl>, `Acute Diarrhoea` <dbl>, Chickenpox <dbl>

$mss_19stations
# A tibble: 58,489 x 7
   Station   Date       Epiyear Epiweek Rainfall Mean_temp Temp_range
   <chr>     <date>       <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
 1 Admiralty 2012-01-01    2012       1      0        27.3       7.40
 2 Admiralty 2012-01-02    2012       1      0        27.1       5.8 
 3 Admiralty 2012-01-03    2012       1      0        26.9       4.60
 4 Admiralty 2012-01-04    2012       1      0        26.8       7   
 5 Admiralty 2012-01-05    2012       1      0        26.4       6.5 
 6 Admiralty 2012-01-06    2012       1      0        27.1       6.40
 7 Admiralty 2012-01-07    2012       1      1.4      26.5       6.20
 8 Admiralty 2012-01-08    2012       2      3.4      24.8       2.40
 9 Admiralty 2012-01-09    2012       2      2.6      25.4       3.4 
10 Admiralty 2012-01-10    2012       2      8.6      24.7       2.80
# ... with 58,479 more rows

$hci_clinics
# A tibble: 1,739 x 4
   name                   add                                          lon   lat
   <chr>                  <chr>                                      <dbl> <dbl>
 1 DA CLINIC @ BISHAN     BLK 501 BISHAN STREET 11 #01-374 Singapor~  104.  1.35
 2 SINGHEALTH POLYCLINIC~ 1 TAMPINES STREET 41 TAMPINES POLYCLINIC ~  104.  1.36
 3 1 MEDICAL TECK GHEE    BLK 410 ANG MO KIO AVENUE 10 TECK GHEE SQ~  104.  1.36
 4 115 EASTPOINT CLINIC ~ BLK 115 BEDOK NORTH RD #01-301 Singapore ~  104.  1.33
 5 18 CLINIC              BLK 101 TOWNER ROAD #01-228 Singapore 322~  104.  1.32
 6 1AESTHETICS, MEDICAL ~ 8 EU TONG SEN STREET THE CENTRAL #14-90 S~  104.  1.29
 7 326 AVENUE 3 CLINIC    BLK 326 SERANGOON AVE 3 #01-382 Singapore~  104.  1.35
 8 338 FAMILY CLINIC      BLK 338 ANG MO KIO AVE 1 #01-1615 Singapo~  104.  1.36
 9 57 MEDICAL CLINIC (GE~ BLK 57 GEYLANG BAHRU #01-3505 Singapore 3~  104.  1.32
10 8 MEDICAL AESTHETIC C~ 51 CUPPAGE ROAD #01-03 Singapore 229469     104.  1.30
# ... with 1,729 more rows

$planning_areas
Simple feature collection with 55 features and 2 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
geographic CRS: WGS 84
# A tibble: 55 x 3
   Name   Description                                                   geometry
   <chr>  <chr>                                               <MULTIPOLYGON [°]>
 1 kml_1  "<center><table><tr><th colsp~ (((103.8572 1.396537, 103.8574 1.39629~
 2 kml_2  "<center><table><tr><th colsp~ (((103.9319 1.343094, 103.9355 1.33956~
 3 kml_3  "<center><table><tr><th colsp~ (((103.8492 1.362753, 103.8494 1.36268~
 4 kml_4  "<center><table><tr><th colsp~ (((103.6973 1.307544, 103.6973 1.30755~
 5 kml_5  "<center><table><tr><th colsp~ (((103.7641 1.370011, 103.7644 1.36947~
 6 kml_6  "<center><table><tr><th colsp~ (((103.8172 1.294353, 103.8174 1.29433~
 7 kml_7  "<center><table><tr><th colsp~ (((103.7745 1.390289, 103.775 1.386071~
 8 kml_8  "<center><table><tr><th colsp~ (((103.7977 1.348128, 103.7981 1.34778~
 9 kml_9  "<center><table><tr><th colsp~ (((103.9018 1.309745, 103.9015 1.30954~
10 kml_10 "<center><table><tr><th colsp~ (((103.9081 1.309816, 103.9082 1.30910~
# ... with 45 more rows

$dengue_polys
Simple feature collection with 1203 features and 2 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 103.6848 ymin: 1.269624 xmax: 103.967 ymax: 1.414322
geographic CRS: WGS 84
# A tibble: 1,203 x 3
   Name   Description                                                   geometry
   <chr>  <chr>                                                    <POLYGON [°]>
 1 kml_1  "<center><table><tr><th colsp~ ((103.843 1.322077, 103.843 1.323886, ~
 2 kml_2  "<center><table><tr><th colsp~ ((103.8484 1.327504, 103.8484 1.329312~
 3 kml_3  "<center><table><tr><th colsp~ ((103.8681 1.327503, 103.8681 1.329312~
 4 kml_4  "<center><table><tr><th colsp~ ((103.843 1.3474, 103.843 1.349208, 10~
 5 kml_5  "<center><table><tr><th colsp~ ((103.8448 1.3474, 103.8448 1.349208, ~
 6 kml_6  "<center><table><tr><th colsp~ ((103.8591 1.322077, 103.8591 1.323886~
 7 kml_7  "<center><table><tr><th colsp~ ((103.8699 1.322077, 103.8699 1.323886~
 8 kml_8  "<center><table><tr><th colsp~ ((103.8322 1.320269, 103.8322 1.322077~
 9 kml_9  "<center><table><tr><th colsp~ ((103.834 1.320269, 103.834 1.322077, ~
10 kml_10 "<center><table><tr><th colsp~ ((103.8555 1.320269, 103.8555 1.322077~
# ... with 1,193 more rows

$aedes_polys
Simple feature collection with 315 features and 2 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 103.6848 ymin: 1.27505 xmax: 103.9652 ymax: 1.446879
geographic CRS: WGS 84
# A tibble: 315 x 3
   Name   Description                                                   geometry
   <chr>  <chr>                                                    <POLYGON [°]>
 1 kml_1  "<center><table><tr><th colsp~ ((103.8537 1.370913, 103.8537 1.372722~
 2 kml_2  "<center><table><tr><th colsp~ ((103.8448 1.367296, 103.8448 1.369104~
 3 kml_3  "<center><table><tr><th colsp~ ((103.8358 1.378148, 103.8358 1.379957~
 4 kml_4  "<center><table><tr><th colsp~ ((103.8861 1.378147, 103.8861 1.379956~
 5 kml_5  "<center><table><tr><th colsp~ ((103.8537 1.365487, 103.8537 1.367296~
 6 kml_6  "<center><table><tr><th colsp~ ((103.8466 1.370913, 103.8466 1.372722~
 7 kml_7  "<center><table><tr><th colsp~ ((103.8717 1.37453, 103.8717 1.376339,~
 8 kml_8  "<center><table><tr><th colsp~ ((103.8843 1.376339, 103.8843 1.378147~
 9 kml_9  "<center><table><tr><th colsp~ ((103.8897 1.390808, 103.8897 1.392617~
10 kml_10 "<center><table><tr><th colsp~ ((103.8286 1.314842, 103.8286 1.316651~
# ... with 305 more rows

$mss_63station_pos
# A tibble: 63 x 9
   Station `Lat.(N)` `Long. (E)` `Period of Dail~ `Period of 30,6~
   <chr>       <dbl>       <dbl> <chr>            <chr>           
 1 Paya L~      1.35        104. Jan 1980-current <NA>            
 2 Macrit~      1.34        104. Jan 1980-current Jan 2014-current
 3 Lower ~      1.37        104. Jan 2010-current Jan 2014-current
 4 Tengah       1.39        104. Jan 1980-current <NA>            
 5 Changi       1.37        104. Jan 1981-current Jan 2014-current
 6 Seletar      1.42        104. Jan 1980-current <NA>            
 7 Pasir ~      1.39        104. Jan 1980-current Jan 2014-current
 8 Kampon~      1.27        104. Jan 1980-current Jan 2014-Oct 20~
 9 Jurong~      1.31        104. Jan 1980-current Jan 2014-current
10 Ulu Pa~      1.33        104. Jan 1980-current Jan 2014-current
# ... with 53 more rows, and 4 more variables: `Period of Mean
#   Temperature` <chr>, `Period of Max and Min Temperature` <chr>, `Period of
#   Mean Wind Speed` <chr>, `Period of Max Wind Speed` <chr>

Behold! Our 7 raw datasets, nicely packaged in the list raw_data. No more importing should be necessary from this point.


Tidy & Transform

data_time

The unit of analysis for data_time is the epidemiological week.

The data from the MOH bulletin is already organized this way. The only things left to do are to select the relevant columns and possibly provide “easier” column names.

The meteorological data time period (2012-2020) matches the MOH bulletin, more or less. However, it’s units are in days. Furthermore, the dates are duplicated for 19 climate stations. To match the MOH bulletin’s national-level data, the meteorological data must be aggregated to the level of the epidemiological week (and year).

How many values are we aggregating for each variable, for each week?

raw_data$mss_19stations %>% 
  tidyr::pivot_longer(Rainfall:Temp_range) %>% 
  tidyr::drop_na() %>% 
  dplyr::count(Epiyear, Epiweek, name) %>% 
  dplyr::select(name, n) %T>% 
  { print(table(.)) } %>% 
  ggplot(aes(x = n, color = name)) +
  geom_histogram(binwidth = 1) +
  facet_grid(name ~ ., scales = "free") + 
  labs(x = "Number of values", y = "Number of weeks") + 
  theme(legend.position = "none")
            n
name          46  51  56  71  73  81  86  89  90  92  94  95  96  97  98  99
  Mean_temp    1   0   0   1   0   2   1   1   3   2   2   3   2   2   4   5
  Rainfall     0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
  Temp_range   0   1   0   0   1   1   0   0   1   0   0   2   1   0   1   4
            n
name         100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
  Mean_temp    4   4   3   7   7   5  12  10  14  26  16  16  18   5   4   1
  Rainfall     0   0   1   0   0   1   1   0   1   0   1   1   1   1   1   0
  Temp_range   0   2   0   2   1   0   2   1   2   8   4   5   5   4   8   4
            n
name         116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
  Mean_temp    2   8   4   6   1   1   5   6  13   9  66   2  10  11  16  23
  Rainfall     2   4   7   7   6   2   3   4  11   9  70   8  15  11  23  19
  Temp_range   4  14  11  20  16  19  19  15  25   7  51   3   2   4   5   7
            n
name         132 133
  Mean_temp   25  55
  Rainfall    38 195
  Temp_range  10 152

Now we join the datasets to get data_time.

data_time <- raw_data$moh_bulletin %>% 
  dplyr::select(Epiyear:End,
                Dengue,
                HFMD,
                `Acute Upper Respiratory Tract infections`,
                `Acute Diarrhoea`) %>% 
  dplyr::rename(URTI = `Acute Upper Respiratory Tract infections`,
                Diarrhoea = `Acute Diarrhoea`) %>% 
  dplyr::left_join(
    raw_data$mss_19stations %>% 
      dplyr::group_by(Epiyear, Epiweek) %>% 
      dplyr::summarise(Mean_rainfall = mean(Rainfall, na.rm = T),
                       Med_rainfall = median(Rainfall, na.rm = T),
                       Mean_temp = mean(Mean_temp, na.rm = T),
                       Med_temp = median(Mean_temp, na.rm = T),
                       Mean_temp_rng = mean(Temp_range, na.rm = T),
                       Med_temp_rng = median(Temp_range, na.rm = T),
                       .groups = "drop"),
    by = c("Epiyear", "Epiweek")
  ) %>% 
  tidyr::drop_na()

# Data Cleaning
mask <- data_time$End < "2012-01-01" | data_time$End > "2020-07-31"

# Must personally see what's wrong
data_time[mask,]
# A tibble: 1 x 14
  Epiyear Epiweek Start               End                 Dengue  HFMD  URTI
    <dbl>   <dbl> <dttm>              <dttm>               <dbl> <dbl> <dbl>
1    2018       1 2017-12-31 00:00:00 1900-01-01 00:00:00     83   326 3158.
# ... with 7 more variables: Diarrhoea <dbl>, Mean_rainfall <dbl>,
#   Med_rainfall <dbl>, Mean_temp <dbl>, Med_temp <dbl>, Mean_temp_rng <dbl>,
#   Med_temp_rng <dbl>
# Correct it - thankfully the start date seems ok
data_time[mask, "End"] <- data_time$Start[mask] + lubridate::days(6)

# Then check that it's corrected
data_time[mask,]
# A tibble: 1 x 14
  Epiyear Epiweek Start               End                 Dengue  HFMD  URTI
    <dbl>   <dbl> <dttm>              <dttm>               <dbl> <dbl> <dbl>
1    2018       1 2017-12-31 00:00:00 2018-01-06 00:00:00     83   326 3158.
# ... with 7 more variables: Diarrhoea <dbl>, Mean_rainfall <dbl>,
#   Med_rainfall <dbl>, Mean_temp <dbl>, Med_temp <dbl>, Mean_temp_rng <dbl>,
#   Med_temp_rng <dbl>

Let’s see what inside data_time

dplyr::glimpse(data_time)
Rows: 444
Columns: 14
$ Epiyear       <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start         <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End           <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue        <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD          <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI          <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea     <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall  <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp     <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp      <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng  <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...

data_time contains 10 features: 4 indicators of disease numbers and 6 aggregated meteorological measures. Each value is associated with an epidemiological week. The time period spans from 2012-W01 to 2020-W30. As the meteorological records for July 2020 have not been published, there are no corresponding values for 2020-W28 and 2020-W29.

data_space

A lot more wrangling is involved.

The unit of analysis for data_space is the planning area. This is specified by raw_data$planning_areas, and the other datasets would have to be transformed to match before merging.

Spatial analysis would be rather limited and more emphasis placed on visualization.

grand_transform_space <- function(raw_data) {
  data = list(
    "dengue_points" = raw_data$dengue_polys,
    "aedes_points" = raw_data$aedes_polys
  ) %>% 
    lapply(function(df) {
      df %>% 
        dplyr::transmute(n = sub(".*: (\\d+).*", "\\1", Description)) %>% 
        sf::st_centroid() %>% 
        .[rep(1:nrow(.), as.numeric(.$n)),]
    })
  
  data[["clinic_points"]] = raw_data$hci_clinics %>% 
    sf::st_as_sf(coords = c("lon", "lat")) %>% 
    sf::`st_crs<-`("WGS84")
  
  data[["weather_points"]] = raw_data$mss_19stations %>% 
    dplyr::filter(Epiyear == 2020) %>% 
    # Filter for the last 3 weeks
    dplyr::filter(Epiweek > max(Epiweek) - 3) %>% 
    # Aggregation schema (up to 7 days x up to 3 weeks -> 1 value)
    dplyr::group_by(Station) %>% 
    dplyr::summarise(mean_rainfall = mean(Rainfall, na.rm = T),
                     med_rainfall = median(Rainfall, na.rm = T),
                     mean_temp = mean(Mean_temp, na.rm = T),
                     med_temp = median(Mean_temp, na.rm = T),
                     mean_temp_rng = mean(Temp_range, na.rm = T),
                     med_temp_rng = median(Temp_range, na.rm = T),
                     .groups = "drop") %>% 
    tidyr::drop_na() %>% 
    dplyr::left_join(
      raw_data$mss_63station_pos %>% 
        dplyr::select(Station:`Long. (E)`),
      by = "Station"
    ) %>% 
    sf::st_as_sf(coords = c("Long. (E)", "Lat.(N)")) %>% 
    sf::`st_crs<-`("WGS84")
  
  # Because we have weather_points, we can define IDW interpolation and 
  #   immediately join with planning_areas.
  
  idw_interpolation <- function(points, polys, ordinal = 2) {
    # Inverse-distance-weighted (IDW) interpolation
    # - Very large ordinal: Proximity (Thiessen) interpolation
    # - Large ordinal: Heavily weighted by inverse-distance
    # - Moderate ordinal: Weighted by inverse-distance
    # - Zero ordinal: Un-weighted average
    # - Negative ordinal: Nonsense! (more weight to further stations)
    
    # Matrix of inverse-distances raised to some power (npolys x npoints)
    weights = polys %>% 
      sf::st_centroid() %>% 
      sf::st_distance(points) %>% 
      units::set_units(km) %>% 
      { 1 / (. ^ ordinal) }  # (55 x 16)
    
    values = points %>% 
      as.data.frame() %>% 
      dplyr::select(-Station, -geometry) %>% 
      as.matrix()  # (16 x 6)
    
    # (55 x 16) * (16 x 6) => (55 x 6)
    (weights %*% values / rowSums(weights)) %>% 
      tibble::as_tibble() %>% 
      dplyr::mutate(plan_area = polys$plan_area)
  }
  
  data[["planning_areas"]] = raw_data$planning_areas %>% 
    dplyr::bind_cols(
      # Extract data from HTML in the Description column (dwelling types)
      .$Description %>% 
        lapply(function(x) {
          xml2::read_html(x) %>% 
            rvest::html_node("table") %>% 
            rvest::html_table() %>% 
            t() %>% 
            `colnames<-`(.[1,]) %>% 
            .[2, 1:10]
        }) %>% 
        dplyr::bind_rows()
    ) %>% 
    dplyr::select(-Name, -Description) %>% 
    dplyr::rename_all(tolower) %>% 
    dplyr::rename(plan_area = pln_area_n,
                  pop = total) %>% 
    dplyr::mutate(plan_area = tools::toTitleCase(tolower(plan_area)),
                  dplyr::across(pop:others, as.numeric)) %>% 
    tibble::as_tibble() %>% 
    sf::st_as_sf() %>% 
    # Add meteorological data
    dplyr::left_join(
      idw_interpolation(data$weather_points, ., ordinal = 10),
      by = "plan_area"
    )
  
  data
}

data <- grand_transform_space(raw_data)

data_space <- list(
  ncases = data$dengue_points,
  nhabs = data$aedes_points,
  nclinics = data$clinic_points
) %>% 
  {
    # Count the number of points in each polygon
    lapply(names(.), function(name) {
      .[[name]] %>% 
        sf::st_intersects(data$planning_areas) %>%
        tibble::as_tibble() %>%
        dplyr::mutate(plan_area = data$planning_areas$plan_area[col.id]) %>%
        dplyr::count(plan_area, name = name)
    })
  } %>% 
  # Join everything
  Reduce(function(x, y) dplyr::left_join(x, y, by = "plan_area"), .) %>% 
  dplyr::left_join(data$planning_areas, by = "plan_area") %>% 
  sf::st_as_sf() %>% 
  dplyr::select(-one_to_two_rm:-five_rm_exec_flats, -others) %>% 
  dplyr::mutate(
    area_km2 = units::set_units(sf::st_area(.), km^2),
    ncases_pht = ncases / pop * 100000,
    nclinics_pht = nclinics / pop * 100000,
    hdb_pp = hdb / pop,
    condos_other_apts_pp = condos_other_apts / pop,
    landed_properties_pp = landed_properties / pop,
    popden = as.numeric(pop / area_km2),
    habsden = as.numeric(nhabs / area_km2),
    clinicsden = as.numeric(nclinics / area_km2),
    caseden = as.numeric(ncases / area_km2),
    label = paste0(
      "<b>", plan_area,
      "</b><br/>Area: ", round(area_km2, 2),
      "<br/>Cases: ", ncases,
      "<br/>Clinics: ", nclinics,
      "<br/>Population: ", pop,
      "<br/><i>Aedes</i> habitats: ", nhabs
    )
  )
dplyr::glimpse(data_space)
Rows: 29
Columns: 26
$ plan_area            <chr> "Ang Mo Kio", "Bedok", "Bishan", "Bukit Batok"...
$ ncases               <int> 74, 260, 61, 44, 36, 12, 18, 16, 347, 188, 7, ...
$ nhabs                <int> 32, 72, 8, 11, 5, 5, 2, 2, 44, 25, 2, 6, 24, 6...
$ nclinics             <int> 73, 101, 31, 52, 76, 32, 39, 44, 59, 74, 40, 7...
$ pop                  <dbl> 166820, 284930, 90280, 138290, 152790, 76380, ...
$ hdb                  <dbl> 137530, 185520, 62600, 105790, 140400, 7450, 1...
$ condos_other_apts    <dbl> 11850, 52270, 16370, 25760, 10430, 35820, 1631...
$ landed_properties    <dbl> 16220, 44630, 10630, 5730, 490, 32260, 2460, 5...
$ geometry             <MULTIPOLYGON [°]> MULTIPOLYGON (((103.8572 1...., M...
$ mean_rainfall        <dbl> 9.494118, 10.738648, 9.499668, 10.728192, 14.7...
$ med_rainfall         <dbl> 2.0000000, 3.2393580, 2.0071371, 1.6073903, 1....
$ mean_temp            <dbl> 27.98235, 28.30682, 27.98177, 27.69870, 28.215...
$ med_temp             <dbl> 28.20000, 28.32052, 28.19905, 27.80008, 28.280...
$ mean_temp_rng        <dbl> 5.229412, 4.115153, 5.228916, 5.450107, 4.9553...
$ med_temp_rng         <dbl> 5.200000, 3.895949, 5.199374, 5.415854, 4.8043...
$ area_km2             [km^2] 13.9413792 [km^2], 21.7331261 [km^2], 7.61892...
$ ncases_pht           <dbl> 44.359190, 91.250483, 67.567568, 31.817196, 23...
$ nclinics_pht         <dbl> 43.75974, 35.44730, 34.33762, 37.60214, 49.741...
$ hdb_pp               <dbl> 0.82442153, 0.65110729, 0.69339832, 0.76498662...
$ condos_other_apts_pp <dbl> 0.07103465, 0.18344857, 0.18132477, 0.18627522...
$ landed_properties_pp <dbl> 0.097230548, 0.156634963, 0.117744794, 0.04143...
$ popden               <dbl> 11965.8176, 13110.4011, 11849.4478, 12421.3671...
$ habsden              <dbl> 2.2953253, 3.3129150, 1.0500175, 0.9880327, 0....
$ clinicsden           <dbl> 5.236211, 4.647284, 4.068818, 4.670700, 5.2549...
$ caseden              <dbl> 5.3079397, 11.9633042, 8.0063836, 3.9521307, 2...
$ label                <chr> "<b>Ang Mo Kio</b><br/>Area: 13.94<br/>Cases: ...

data_space contains 19 main features for each planning area (URA MP14), including: number of dengue cases, number of Aedes mosquito habitats, number of clinics, population, various dwelling types, area, and meteorological variables. Planning areas with no record of dengue cases were excluded, leaving 33 of the 55 planning areas (a large part due to unavailability of data for the north-west region).

Explore

# data_time %T>% 
#   dplyr::glimpse() %>% 
#   dplyr::select(-matches("Epiweek|End")) %>% 
#   dplyr::mutate(Epiyear = as.factor(Epiyear)) %>% 
#   tidyr::pivot_longer(Dengue:Med_temp_rng) %>% 
#   ggplot(aes(x = Start, y = value, color = Epiyear)) + 
#   geom_line() + 
#   geom_point(alpha = 0.3) + 
#   facet_grid(name ~ ., scales = "free_y") + 
#   labs(title = "Weekly history of variables from 2012 to 2020",
#        x = "",
#        y = "",
#        caption = "Sources: moh.gov.sg, weather.gov.sg") + 
#   theme(legend.position = "none")

Choropleths

Absolute numbers
data_space %>% 
  dplyr::select(ncases:pop, mean_rainfall:med_temp_rng, geometry) %>% 
  gridExtra::grid.arrange(grobs = lapply(list(
    ggplot(aes(fill = ncases), data = .),
    ggplot(aes(fill = nhabs), data = .),
    ggplot(aes(fill = nclinics), data = .),
    ggplot(aes(fill = pop), data = .),
    ggplot(aes(fill = mean_rainfall), data = .),
    ggplot(aes(fill = med_rainfall), data = .),
    ggplot(aes(fill = mean_temp), data = .),
    ggplot(aes(fill = med_temp), data = .),
    ggplot(aes(fill = mean_temp_rng), data = .),
    ggplot(aes(fill = med_temp_rng), data = .)
  ), function(x) {
    x + 
      geom_sf() + 
      scale_fill_viridis_c(guide = guide_colourbar(
        title.position = "top",
        title.hjust = .5,
        barwidth = 10,
        barheight = .5
      )) + 
      theme_void() + 
      theme(legend.position = "bottom")
  }), ncol = 2)

Areal densities
data_space %>% 
  dplyr::select(ncases:pop, area_km2, geometry) %>% 
  dplyr::mutate_at(dplyr::vars(-geometry), ~(. / area_km2)) %>% 
  dplyr::select(-area_km2) %>% 
  dplyr::mutate_at(dplyr::vars(-geometry), as.numeric) %>% 
  gridExtra::grid.arrange(grobs = lapply(list(
    ggplot(aes(fill = ncases), data = .),
    ggplot(aes(fill = nhabs), data = .),
    ggplot(aes(fill = nclinics), data = .),
    ggplot(aes(fill = pop), data = .)
  ), function(x) {
    x + 
      geom_sf() + 
      scale_fill_viridis_c(guide = guide_colourbar(
        title.position = "top",
        title.hjust = .5,
        barwidth = 10,
        barheight = .5
      )) + 
      theme_void() + 
      theme(legend.position = "bottom")
  }), ncol = 2)

Interactive plot of the spatial distribution of dengue cases

data_space %>% 
  leaflet::leaflet(width = "100%") %>%
  leaflet::addTiles() %>%
  leaflet::addPolygons(
    weight = 1,
    opacity = 1,
    fillOpacity = 0.6,
    smoothFactor = 0.5,
    fillColor = ~leaflet::colorNumeric("Reds", caseden)(caseden),
    label = ~lapply(label, htmltools::HTML),
    popup = ~lapply(label, htmltools::HTML)
  ) %>%
  leaflet::addCircleMarkers(
    data = data$dengue_points,
    color = "red",
    radius = 5,
    fillOpacity = 0.5,
    clusterOptions = leaflet::markerClusterOptions()
  ) %>% 
  leaflet::addLabelOnlyMarkers(
    data = sf::st_centroid(data_space),
    label =  ~plan_area,
    labelOptions = leaflet::labelOptions(
      noHide = T,
      textOnly = T,
      direction = "center",
      style = list("color" = "blue"))
  )


Model

Dengue

Let’s plot the x variables (Mean_rainfall, Med_temp, Med_temp_rng) and Y (Dengue)

{par(mfrow = c(3, 2))
  hist(data_time$Dengue, breaks=40)
  hist(data_time$Mean_rainfall, breaks=40)
  hist(data_time$Mean_temp, breaks=40)
  hist(data_time$Med_temp, breaks=40)
  hist(data_time$Mean_temp_rng, breaks=40)
  hist(data_time$Med_temp_rng, breaks=40)
}

ks.test(data_time$Mean_temp, data_time$Med_temp)

    Two-sample Kolmogorov-Smirnov test

data:  data_time$Mean_temp and data_time$Med_temp
D = 0, p-value = 1
alternative hypothesis: two-sided
ks.test(data_time$Mean_temp_rng, data_time$Med_temp_rng)

    Two-sample Kolmogorov-Smirnov test

data:  data_time$Mean_temp_rng and data_time$Med_temp_rng
D = 0.065315, p-value = 0.2999
alternative hypothesis: two-sided

Mean_temp and Med_temp have similar distributions, and Mean_temp_rng and Med_temp_rng have similar distributions, therefore Med_temp and Med_temp_rng were used.

Let’s run a correlation model to see the relationship between variables

data_time %>% 
  dplyr::select(Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
  as.matrix() %>%
  Hmisc::rcorr() %>%
  broom::tidy() %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>% 
  dplyr::arrange(estimate)
# A tibble: 6 x 5
  column1       column2       estimate     n p.value
  <chr>         <chr>            <dbl> <dbl>   <dbl>
1 Med_temp      Mean_rainfall   -0.499   444   0    
2 Med_temp_rng  Dengue          -0.151   444   0.001
3 Mean_rainfall Dengue          -0.052   444   0.278
4 Med_temp_rng  Med_temp         0.031   444   0.51 
5 Med_temp_rng  Mean_rainfall    0.084   444   0.077
6 Med_temp      Dengue           0.201   444   0    

Run a linear regression model

model1_Dengue <- lm(Dengue ~ Mean_rainfall + Med_temp + Med_temp_rng,
                    data = data_time)

broom::tidy(model1_Dengue) %>% 
  dplyr::mutate(dplyr::across(is.numeric, ~round(., 3)))
# A tibble: 4 x 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -1237.      378.       -3.27   0.001
2 Mean_rainfall     4.01      2.46      1.63   0.104
3 Med_temp         61.9      13.2       4.70   0    
4 Med_temp_rng    -43.3      12.1      -3.59   0    

Let’s check the skewness of y variable

data_time %>% 
  tidyr::drop_na() %>% 
  { e1071::skewness(.$Dengue) }  # [1] 2.033787
[1] 2.033787

Let’s see how the density plots look like after transformation

{
  par(mfrow = c(2, 2))
  plot(density(data_time$Dengue, na.rm = T), main = "untransformed")
  plot(density(sqrt(data_time$Dengue), na.rm = T), main = "sqrt")
  plot(density(log10(data_time$Dengue), na.rm = T), main = "log10")
  plot(density(1 / data_time$Dengue, na.rm = T), main = "inverse")
}

Not really good. Try them on Model 1

model1_log_Dengue <- lm(log10(Dengue)~Mean_rainfall+Med_temp+Med_temp_rng,
                    data=data_time)
broom::tidy(model1_log_Dengue)
# A tibble: 4 x 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -0.309     0.700      -0.441 0.659    
2 Mean_rainfall  0.00294   0.00456     0.644 0.520    
3 Med_temp       0.0993    0.0244      4.06  0.0000572
4 Med_temp_rng  -0.0407    0.0224     -1.82  0.0695   
gvlma::gvlma(model1_log_Dengue)

Call:
lm(formula = log10(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
    -0.309087       0.002939       0.099330      -0.040711  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_log_Dengue) 

                     Value   p-value                   Decision
Global Stat        28.1144 1.182e-05 Assumptions NOT satisfied!
Skewness            1.9927 1.581e-01    Assumptions acceptable.
Kurtosis           17.8056 2.447e-05 Assumptions NOT satisfied!
Link Function       0.1611 6.882e-01    Assumptions acceptable.
Heteroscedasticity  8.1550 4.294e-03 Assumptions NOT satisfied!
model1_sqrt_Dengue <- lm(sqrt(Dengue)~Mean_rainfall+Med_temp+Med_temp_rng,
                       data=data_time)
broom::tidy(model1_sqrt_Dengue)
# A tibble: 4 x 5
  term          estimate std.error statistic    p.value
  <chr>            <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   -28.7      11.2        -2.57 0.0105    
2 Mean_rainfall   0.0782    0.0727      1.08 0.283     
3 Med_temp        1.74      0.390       4.47 0.00000986
4 Med_temp_rng   -0.974     0.357      -2.73 0.00655   
gvlma::gvlma(model1_sqrt_Dengue)

Call:
lm(formula = sqrt(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
    -28.71254        0.07821        1.74274       -0.97433  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_sqrt_Dengue) 

                     Value   p-value                   Decision
Global Stat        32.7985 1.313e-06 Assumptions NOT satisfied!
Skewness           22.7656 1.830e-06 Assumptions NOT satisfied!
Kurtosis            0.2636 6.077e-01    Assumptions acceptable.
Link Function       0.5967 4.398e-01    Assumptions acceptable.
Heteroscedasticity  9.1726 2.457e-03 Assumptions NOT satisfied!
model1_inv_Dengue <- lm((1/Dengue)~Mean_rainfall+Med_temp+Med_temp_rng,
                        data=data_time)
broom::tidy(model1_inv_Dengue)
# A tibble: 4 x 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.0529    0.0141        3.76  0.000193
2 Mean_rainfall -0.0000488 0.0000916    -0.533 0.594   
3 Med_temp      -0.00164   0.000491     -3.34  0.000896
4 Med_temp_rng   0.000310  0.000449      0.689 0.491   
gvlma::gvlma(model1_inv_Dengue)

Call:
lm(formula = (1/Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
    0.0528561     -0.0000488     -0.0016408      0.0003096  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_inv_Dengue) 

                     Value   p-value                   Decision
Global Stat        200.447 0.000e+00 Assumptions NOT satisfied!
Skewness           133.155 0.000e+00 Assumptions NOT satisfied!
Kurtosis            41.572 1.136e-10 Assumptions NOT satisfied!
Link Function        1.142 2.853e-01    Assumptions acceptable.
Heteroscedasticity  24.578 7.135e-07 Assumptions NOT satisfied!

Let’s have a look at the models’ adjusted R2 values

list(
  "untransformed" = model1_Dengue,
  "log" = model1_log_Dengue,
  "sqrt" = model1_sqrt_Dengue,
  "inv" = model1_inv_Dengue
) %>% 
  lapply(broom::glance) %>% 
  { dplyr::bind_cols(names(.), dplyr::bind_rows(.)) } %>% 
  dplyr::rename(name = ...1)
New names:
* NA -> ...1
# A tibble: 4 x 13
  name  r.squared adj.r.squared   sigma statistic p.value    df logLik    AIC
  <chr>     <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>
1 untr~    0.0709        0.0646 2.02e+2     11.2  4.34e-7     3 -2985.  5981.
2 log      0.0472        0.0407 3.75e-1      7.26 9.19e-5     3  -193.   395.
3 sqrt     0.0599        0.0535 5.98e+0      9.34 5.34e-6     3 -1422.  2854.
4 inv      0.0292        0.0226 7.53e-3      4.41 4.53e-3     3  1543. -3075.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

It seems that model1_Dengue (untransformed) has a higher adjusted R2 value than the other models, and it explains about 6% of the variance.

Plot the regression models

data_time %>% 
  dplyr::select(Dengue, Med_temp, Med_temp_rng, Mean_rainfall) %>% 
  # scale() %>% 
  tibble::as_tibble() %>% 
  tidyr::pivot_longer(-Dengue) %>% 
  ggplot(aes(x = value, y = Dengue, color = name)) +
  geom_point(alpha = 0.4) + 
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") + 
  facet_grid(. ~ name, scales = "free") + 
  theme(legend.position = "none")

24 hr temperature mean per month is 27.5 mean Rainfall is 13.8

Since rainfall has marginal significance at an alpha of 10% and known to have effect on vector born diseases, we decided to explore rainfall variable by categorizing it (<11mm of rain fall and >=11mm of rainfall)

data_time$Mean_rainfall %>% 
  hist()

data_time <- data_time %>% 
  dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))

# glimpse(data_time)

Re-Run Model1 With categorised Rain variable

model2_Dengue <- lm(Dengue~Rain_11+Med_temp+Med_temp_rng,
                                      data=data_time)

gvlma::gvlma(model2_Dengue)

Call:
lm(formula = Dengue ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time)

Coefficients:
 (Intercept)       Rain_11      Med_temp  Med_temp_rng  
    -1219.75         62.59         61.56        -42.21  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model2_Dengue) 

                     Value   p-value                   Decision
Global Stat        803.970 0.000e+00 Assumptions NOT satisfied!
Skewness           218.166 0.000e+00 Assumptions NOT satisfied!
Kurtosis           556.830 0.000e+00 Assumptions NOT satisfied!
Link Function        2.482 1.151e-01    Assumptions acceptable.
Heteroscedasticity  26.493 2.645e-07 Assumptions NOT satisfied!
broom::tidy(model2_Dengue)
# A tibble: 4 x 5
  term         estimate std.error statistic     p.value
  <chr>           <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   -1220.      351.      -3.48 0.000556   
2 Rain_11          62.6      28.1      2.22 0.0266     
3 Med_temp         61.6      12.3      5.01 0.000000805
4 Med_temp_rng    -42.2      12.0     -3.53 0.000463   

Add Interaction term

model2_Dengue_Int <- lm(Dengue ~ (Med_temp + Med_temp_rng) * Rain_11,
                       data=data_time)

gvlma::gvlma(model2_Dengue_Int)

Call:
lm(formula = Dengue ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time)

Coefficients:
         (Intercept)              Med_temp          Med_temp_rng  
            -1187.07                 56.90                -26.90  
             Rain_11      Med_temp:Rain_11  Med_temp_rng:Rain_11  
            -1861.85                 97.09               -111.60  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model2_Dengue_Int) 

                    Value   p-value                   Decision
Global Stat        513.22 0.000e+00 Assumptions NOT satisfied!
Skewness           166.77 0.000e+00 Assumptions NOT satisfied!
Kurtosis           310.38 0.000e+00 Assumptions NOT satisfied!
Link Function       16.41 5.092e-05 Assumptions NOT satisfied!
Heteroscedasticity  19.65 9.293e-06 Assumptions NOT satisfied!
broom::tidy(model2_Dengue_Int)
# A tibble: 6 x 5
  term                 estimate std.error statistic   p.value
  <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)           -1187.      382.      -3.10 0.00204  
2 Med_temp                 56.9      13.1      4.34 0.0000179
3 Med_temp_rng            -26.9      13.3     -2.02 0.0435   
4 Rain_11               -1862.     1006.      -1.85 0.0649   
5 Med_temp:Rain_11         97.1      39.4      2.46 0.0142   
6 Med_temp_rng:Rain_11   -112.       32.7     -3.41 0.000702 
anova(model2_Dengue,model2_Dengue_Int)
Analysis of Variance Table

Model 1: Dengue ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: Dengue ~ (Med_temp + Med_temp_rng) * Rain_11
  Res.Df      RSS Df Sum of Sq      F   Pr(>F)   
1    440 17909105                                
2    438 17396320  2    512786 6.4554 0.001726 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(model2_Dengue)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0757        0.0694  202.      12.0 1.44e-7     3 -2984. 5979. 5999.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model2_Dengue_Int)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.102        0.0919  199.      9.97 4.82e-9     5 -2978. 5970. 5998.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

There is an interaction effect (Rain_11 on 2 other predictors) and Anova test shows adding the interaction terms improves the model fit (p value=0.0017). This interaction model has a higher R2 value too.

r.squared(Model2_Dengue) –> 0.0757 r.squared(Model2_Dengue_Int) –> 0.102 adj.r.squared(Model2_Dengue) –> 0.0694 adj.r.squared(Model2_Dengue_Int) –> 0.0919

Unfortunately, all model assumptions failed. So This model won’t be a good one for prediction

Lets explore this interaction model

#Report the model summary and significance using export_summs() function
library(jtools)
jtools::export_summs(model2_Dengue,model2_Dengue_Int,
             error_format = "(p = {p.value})",
             model.names = c("Main Effects Only",
                             "Interaction Effects"),
             digits = 3)
Main Effects OnlyInteraction Effects
(Intercept)-1219.750 ***-1187.070 ** 
(p = 0.001)   (p = 0.002)   
Rain_1162.587 *  -1861.847    
(p = 0.027)   (p = 0.065)   
Med_temp61.565 ***56.900 ***
(p = 0.000)   (p = 0.000)   
Med_temp_rng-42.213 ***-26.895 *  
(p = 0.000)   (p = 0.044)   
Med_temp:Rain_11        97.086 *  
        (p = 0.014)   
Med_temp_rng:Rain_11        -111.602 ***
        (p = 0.001)   
N444        444        
R20.076    0.102    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Probing Interaction Effects Run Spotlight analysis, using Johnson-Neyman Techniques

#  a. Median Temperature
interactions::sim_slopes(model2_Dengue_Int, 
           pred = Rain_11,
           modx = Med_temp_rng, 
           johnson_neyman = T) #7.01
JOHNSON-NEYMAN INTERVAL 

When Med_temp_rng is OUTSIDE the interval [7.01, 9.04], the slope of
Rain_11 is p < .05.

Note: The range of observed values of Med_temp_rng is [3.50, 8.50]

SIMPLE SLOPES ANALYSIS 

Slope of Rain_11 when Med_temp_rng = 5.63 (- 1 SD): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  223.99   53.48     4.19   0.00

Slope of Rain_11 when Med_temp_rng = 6.43 (Mean): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  134.46   38.04     3.53   0.00

Slope of Rain_11 when Med_temp_rng = 7.24 (+ 1 SD): 

   Est.    S.E.   t val.      p
------- ------- -------- ------
  44.92   37.56     1.20   0.23
interactions::interact_plot(model2_Dengue_Int,
              pred="Med_temp",
              modx = "Rain_11",
              modx.labels = c("less than 11mm of rain fall", 
                              "rain fall >=11mm"),
              interval = T,
              int.width = .95,
              colors = c("deeppink",
                         "darkgreen"),
              vary.lty = T,
              line.thickness = 1,
              legend.main = "") +
  ggplot2:: geom_vline(xintercept = 27.15, col = "red", linetype = 3, size = 1)+
  labs(title = "Effect of Rain fall and Temperature on Dengue cases in Singapore",
       subtitle = "With an average weekly rainfall of 11mm and above,the number of Dengue cases\nincreases with rise in temperature", 
       x = "Weekly Median Temperature",
       y = "Number of Dengue Cases",
       caption = "Source: MOH, NEA Websites") +
  annotate("text", 
           x = 28, 
           y = 0,
           label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") + 
  theme(legend.position = "top")

#  b. Median Temperature Range
interactions::sim_slopes(model2_Dengue_Int, 
           pred = Rain_11,
           modx = Med_temp, 
           johnson_neyman = T) #27.15
JOHNSON-NEYMAN INTERVAL 

When Med_temp is OUTSIDE the interval [23.43, 27.15], the slope of Rain_11
is p < .05.

Note: The range of observed values of Med_temp is [25.06, 29.96]

SIMPLE SLOPES ANALYSIS 

Slope of Rain_11 when Med_temp = 27.11 (- 1 SD): 

   Est.    S.E.   t val.      p
------- ------- -------- ------
  52.49   28.96     1.81   0.07

Slope of Rain_11 when Med_temp = 27.96 (Mean): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  134.46   38.04     3.53   0.00

Slope of Rain_11 when Med_temp = 28.80 (+ 1 SD): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  216.42   65.35     3.31   0.00
interactions::interact_plot(model2_Dengue_Int,
              pred="Med_temp_rng",
              modx = "Rain_11",
              modx.labels = c("less than 11mm of rain fall", 
                              "rain fall >=11mm"), 
              interval = T,
              int.width = .95,
              colors = c("deeppink",
                         "darkgreen"),
              vary.lty = T,
              line.thickness = 1,
              legend.main = "Rain Fall")+
  ggplot2:: geom_vline(xintercept = 7.01, col = "red", linetype = 3, size = 1)+
  labs(title = "Effect of Rain fall and Temperature Variation on Dengue Cases in Singapore",
       subtitle = "With an average weekly rainfall of 11mm and above,the number of Dengue cases\ndecreases when tempeature variation is HIGH", 
       x = "Weekly Median Temperature Range",
       y = "Number of Dengue Cases",
       caption = "Source: MOH, NEA Websites") +
  annotate("text", 
           x = 6, 
           y = 100,
           label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") + 
  theme(legend.position = "top")

Upper Respiratory Tract Infections (URTI)

Since the number of URTI cases in Singapore drastically reduced due to COVID-19, we decided to exclude the values since April 2020 (from circuit breaker period), for this analysis

Let’s plot the x variables (Mean_rainfall,Med_temp,Med_temp_rng) and Y (URTI)

data_time_URTI <- data_time %>% 
  dplyr::filter(URTI >= 2000)

{
  par(mfrow = c(2, 2))
  hist(data_time_URTI$Mean_rainfall, breaks=40)
  hist(data_time_URTI$Med_temp, breaks=40)
  hist(data_time_URTI$Med_temp_rng, breaks=40)
  hist(data_time_URTI$URTI, breaks=40)
}

Let’s run a correlation model to see the relationship between variables

data_time_URTI %>% 
  dplyr::select(URTI, Mean_rainfall, Med_temp, Med_temp_rng) %>%
  as.matrix() %>%
  Hmisc::rcorr() %>%
  broom::tidy() %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>% 
  dplyr::arrange(estimate)
# A tibble: 6 x 5
  column1       column2       estimate     n p.value
  <chr>         <chr>            <dbl> <dbl>   <dbl>
1 Med_temp      Mean_rainfall   -0.508   429   0    
2 Med_temp      URTI            -0.195   429   0    
3 Med_temp_rng  URTI            -0.086   429   0.075
4 Mean_rainfall URTI            -0.042   429   0.383
5 Med_temp_rng  Med_temp         0.027   429   0.577
6 Med_temp_rng  Mean_rainfall    0.091   429   0.06 

Run a simple linear regression model

model1_URTI <- lm(URTI~Mean_rainfall+Med_temp+Med_temp_rng,
                    data=data_time_URTI)

broom::tidy(model1_URTI)
# A tibble: 4 x 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     6804.     729.        9.34 5.59e-19
2 Mean_rainfall    -15.7      4.74     -3.31 1.01e- 3
3 Med_temp        -133.      25.5      -5.22 2.75e- 7
4 Med_temp_rng     -30.5     23.2      -1.31 1.91e- 1
gvlma::gvlma(model1_URTI)

Call:
lm(formula = URTI ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time_URTI)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
      6804.45         -15.68        -132.97         -30.45  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_URTI) 

                     Value   p-value                   Decision
Global Stat        119.874 0.000e+00 Assumptions NOT satisfied!
Skewness            48.234 3.782e-12 Assumptions NOT satisfied!
Kurtosis            48.947 2.630e-12 Assumptions NOT satisfied!
Link Function        3.667 5.550e-02    Assumptions acceptable.
Heteroscedasticity  19.026 1.290e-05 Assumptions NOT satisfied!

Assumption checks failed Let’s check the skewness of y variable

e1071::skewness(data_time_URTI$URTI) #[1] 0.8635183
[1] 0.8634216
#Lets see how the density plots look like after transformation
{
  par(mfrow = c(2, 2))
  plot(density(data_time_URTI$URTI), main = "untransformed")
  plot(density(sqrt(data_time_URTI$URTI)), main = "sqrt")
  plot(density(log10(data_time_URTI$URTI)), main = "log10")
  plot(density(1 / data_time_URTI$URTI), main = "inverse")
}

Try them on Model 1

model1_log_URTI <- lm(log10(URTI)~Mean_rainfall+Med_temp+Med_temp_rng,
                        data=data_time_URTI)
broom::tidy(model1_log_URTI)
# A tibble: 4 x 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    4.03     0.110        36.8  3.38e-134
2 Mean_rainfall -0.00232  0.000713     -3.26 1.22e-  3
3 Med_temp      -0.0195   0.00383      -5.09 5.50e-  7
4 Med_temp_rng  -0.00523  0.00350      -1.50 1.35e-  1
gvlma::gvlma(model1_log_URTI)

Call:
lm(formula = log10(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time_URTI)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
     4.034784      -0.002320      -0.019479      -0.005233  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_log_URTI) 

                    Value   p-value                   Decision
Global Stat        22.357 0.0001701 Assumptions NOT satisfied!
Skewness            8.284 0.0039991 Assumptions NOT satisfied!
Kurtosis            1.968 0.1606558    Assumptions acceptable.
Link Function       3.159 0.0755032    Assumptions acceptable.
Heteroscedasticity  8.946 0.0027807 Assumptions NOT satisfied!
model1_sqrt_URTI <- lm(sqrt(URTI)~Mean_rainfall+Med_temp+Med_temp_rng,
                         data=data_time_URTI)
broom::tidy(model1_sqrt_URTI)
# A tibble: 4 x 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     89.6      6.76       13.3  7.25e-34
2 Mean_rainfall   -0.144    0.0439     -3.29 1.09e- 3
3 Med_temp        -1.22     0.236      -5.16 3.73e- 7
4 Med_temp_rng    -0.303    0.215      -1.41 1.60e- 1
gvlma::gvlma(model1_sqrt_URTI)

Call:
lm(formula = sqrt(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time_URTI)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
      89.6374        -0.1444        -1.2182        -0.3032  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_sqrt_URTI) 

                    Value   p-value                   Decision
Global Stat        52.462 1.104e-10 Assumptions NOT satisfied!
Skewness           23.091 1.545e-06 Assumptions NOT satisfied!
Kurtosis           12.591 3.877e-04 Assumptions NOT satisfied!
Link Function       3.415 6.462e-02    Assumptions acceptable.
Heteroscedasticity 13.365 2.563e-04 Assumptions NOT satisfied!
model1_inv_URTI <- lm((1/URTI)~Mean_rainfall+Med_temp+Med_temp_rng,
                        data=data_time_URTI)
broom::tidy(model1_inv_URTI)
# A tibble: 4 x 5
  term             estimate   std.error statistic    p.value
  <chr>               <dbl>       <dbl>     <dbl>      <dbl>
1 (Intercept)   -0.000108   0.0000902       -1.20 0.231     
2 Mean_rainfall  0.00000185 0.000000586      3.16 0.00168   
3 Med_temp       0.0000154  0.00000315       4.89 0.00000143
4 Med_temp_rng   0.00000475 0.00000288       1.65 0.0997    
gvlma::gvlma(model1_inv_URTI)

Call:
lm(formula = (1/URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time_URTI)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
   -1.083e-04      1.854e-06      1.541e-05      4.745e-06  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model1_inv_URTI) 

                     Value p-value                Decision
Global Stat        6.13667 0.18917 Assumptions acceptable.
Skewness           0.31090 0.57713 Assumptions acceptable.
Kurtosis           0.03011 0.86224 Assumptions acceptable.
Link Function      2.65447 0.10326 Assumptions acceptable.
Heteroscedasticity 3.14119 0.07634 Assumptions acceptable.

Assumptions acceptable

Plot the regression model 1

data_time_URTI %>% 
  dplyr::select(URTI, Med_temp, Med_temp_rng,Mean_rainfall) %>% 
  # scale() %>% 
  tibble::as_tibble() %>% 
  tidyr::pivot_longer(-URTI) %>% 
  ggplot(aes(x = value, y = URTI, color = name)) +
  geom_point(alpha = 0.4) + 
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") + 
  facet_grid(. ~ name, scales = "free") + 
  theme(legend.position = "none")

From the plot, there are more cases when rain fall is lesser. So we decided to explore rainfall variable by categorizing it. (<11cm of rain fall and >=11cm of rainfall))

data_time_URTI <- data_time_URTI %>% 
  dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))

# glimpse(data_time_URTI)

Re-Run Model1 With categorised Rain variable

model2_inv_URTI<- lm((1/URTI)~Rain_11+Med_temp+Med_temp_rng,
                   data=data_time_URTI)

gvlma::gvlma(model2_inv_URTI)

Call:
lm(formula = (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time_URTI)

Coefficients:
 (Intercept)       Rain_11      Med_temp  Med_temp_rng  
  -3.362e-05     1.536e-05     1.290e-05     5.447e-06  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model2_inv_URTI) 

                     Value p-value                Decision
Global Stat        7.11459  0.1300 Assumptions acceptable.
Skewness           0.06411  0.8001 Assumptions acceptable.
Kurtosis           0.01816  0.8928 Assumptions acceptable.
Link Function      3.41266  0.0647 Assumptions acceptable.
Heteroscedasticity 3.61966  0.0571 Assumptions acceptable.
broom::tidy(model2_inv_URTI)
# A tibble: 4 x 5
  term            estimate  std.error statistic   p.value
  <chr>              <dbl>      <dbl>     <dbl>     <dbl>
1 (Intercept)  -0.0000336  0.0000842     -0.399 0.690    
2 Rain_11       0.0000154  0.00000683     2.25  0.0251   
3 Med_temp      0.0000129  0.00000296     4.36  0.0000162
4 Med_temp_rng  0.00000545 0.00000288     1.89  0.0589   

Add Interaction term

model2_inv_URTI_Int <- lm((1/URTI)~(Med_temp+Med_temp_rng)*Rain_11,
                           data=data_time_URTI)

gvlma::gvlma(model2_inv_URTI_Int)

Call:
lm(formula = (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11, 
    data = data_time_URTI)

Coefficients:
         (Intercept)              Med_temp          Med_temp_rng  
          -1.161e-04             1.590e-05             5.151e-06  
             Rain_11      Med_temp:Rain_11  Med_temp_rng:Rain_11  
           8.132e-04            -3.302e-05             1.568e-05  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = model2_inv_URTI_Int) 

                    Value p-value                Decision
Global Stat        4.8987 0.29786 Assumptions acceptable.
Skewness           0.1427 0.70559 Assumptions acceptable.
Kurtosis           0.3799 0.53765 Assumptions acceptable.
Link Function      0.7336 0.39173 Assumptions acceptable.
Heteroscedasticity 3.6424 0.05632 Assumptions acceptable.
broom::tidy(model2_inv_URTI_Int)
# A tibble: 6 x 5
  term                    estimate  std.error statistic     p.value
  <chr>                      <dbl>      <dbl>     <dbl>       <dbl>
1 (Intercept)          -0.000116   0.0000917      -1.27 0.206      
2 Med_temp              0.0000159  0.00000315      5.05 0.000000645
3 Med_temp_rng          0.00000515 0.00000318      1.62 0.106      
4 Rain_11               0.000813   0.000254        3.21 0.00144    
5 Med_temp:Rain_11     -0.0000330  0.0000101      -3.27 0.00116    
6 Med_temp_rng:Rain_11  0.0000157  0.00000837      1.87 0.0617     
anova(model2_inv_URTI,model2_inv_URTI_Int)
Analysis of Variance Table

Model 1: (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11
  Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
1    425 9.6001e-07                                 
2    423 9.3611e-07  2 2.3896e-08 5.3988 0.004839 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(model2_inv_URTI)
# A tibble: 1 x 12
  r.squared adj.r.squared   sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1    0.0534        0.0467 4.75e-5      7.99 3.43e-5     3  3664. -7317. -7297.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(model2_inv_URTI_Int)
# A tibble: 1 x 12
  r.squared adj.r.squared   sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1    0.0769        0.0660 4.70e-5      7.05 2.40e-6     5  3669. -7324. -7296.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Median temperature is significant at an alpha level of 5%. There is an interaction effect too (Rain_11 on Med_temp) and Anova test shows adding the interaction terms improves the model fit (p value = 0.004838). This interaction model has a higher R2 value too.

r.squared(model2_inv_URTI) –> 0.053 r.squared(model2_inv_URTI_Int)–> 0.077 adj.r.squared(model2_inv_URTI) –> 0.0467 adj.r.squared(model2_inv_URTI_Int)–> 0.0660

Report the model summary and significance using export_summs() function

jtools::export_summs(model2_inv_URTI,model2_inv_URTI_Int,
             error_format = "(p = {p.value})",
             model.names = c("Main Effects Only",
                             "Interaction Effects"),
             digits = 3)
Main Effects OnlyInteraction Effects
(Intercept)-0.000    -0.000    
(p = 0.690)   (p = 0.206)   
Rain_110.000 *  0.001 ** 
(p = 0.025)   (p = 0.001)   
Med_temp0.000 ***0.000 ***
(p = 0.000)   (p = 0.000)   
Med_temp_rng0.000    0.000    
(p = 0.059)   (p = 0.106)   
Med_temp:Rain_11        -0.000 ** 
        (p = 0.001)   
Med_temp_rng:Rain_11        0.000    
        (p = 0.062)   
N429        429        
R20.053    0.077    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Probing Interaction Effects

#Run Spotlight analysis, using Johnson-Neyman Techniques
# Median Temperature

# interactions::sim_slopes(model2_inv_URTI_Int,
#            pred = Rain_11,
#            modx = Med_temp,
#            johnson_neyman = T) ##[27.28, 28.61]
# 
# #Run interaction_plot() by adding benchmark for regions of significance
# # Use plot B to see the actual URTI numbers instead of y inverse
# #  a. Median Temperature
# interactions::interact_plot(model2_inv_URTI_Int,
#             pred="Med_temp",
#             modx = "Rain_11",
#             modx.labels = c("less than 11cm of rain fall",
#                             "rain fall >=11cm"),
#             interval = T,
#             int.width = .95,
#             colors = c("deeppink",
#                        "darkgreen"),
#             vary.lty = T,
#             line.thickness = 1,
#             legend.main = "") +
# ggplot2:: geom_vline(xintercept = 27.28, col = "red", linetype = 3, size = 1)+
# labs(title = "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore",
#      subtitle = "For lower weekly rainfall, the number of URTI cases\ndereases with rise in temperature",
#      x = "Weekly Median Temperature",
#      y = "1/Number of URTI Cases",
#      caption = "Source: MOH, NEA websites") +
# annotate("text",
#          x = 28,
#          y = 0.0003,
#          label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") +
# theme(legend.position = "top")
##############################################
#This model model2_URTI is just for plotting purpose.
model2_URTI_Int <- lm(URTI~(Med_temp+Med_temp_rng)*Rain_11,
                           data=data_time_URTI)

#Run Spotlight analysis, using Johnson-Neyman Techniques
#  Median Temperature
# sim_slopes(model2_URTI_Int, 
#            pred = Rain_11,
#            modx = Med_temp, 
#            johnson_neyman = F) 

interactions::sim_slopes(model2_URTI_Int,
           pred = Rain_11,
           modx = Med_temp,
           johnson_neyman = T) #[27.24, 28.48]
JOHNSON-NEYMAN INTERVAL 

When Med_temp is OUTSIDE the interval [27.24, 28.48], the slope of Rain_11
is p < .05.

Note: The range of observed values of Med_temp is [25.06, 29.96]

SIMPLE SLOPES ANALYSIS 

Slope of Rain_11 when Med_temp = 27.09 (- 1 SD): 

     Est.    S.E.   t val.      p
--------- ------- -------- ------
  -149.99   56.54    -2.65   0.01

Slope of Rain_11 when Med_temp = 27.94 (Mean): 

   Est.    S.E.   t val.      p
------- ------- -------- ------
  79.97   79.73     1.00   0.32

Slope of Rain_11 when Med_temp = 28.78 (+ 1 SD): 

    Est.     S.E.   t val.      p
-------- -------- -------- ------
  309.93   137.96     2.25   0.03
interactions::interact_plot(model2_URTI_Int,
              pred="Med_temp",
              modx = "Rain_11",
              modx.labels = c("less than 11cm of rain fall",
                              "rain fall >=11cm"),
              interval = T,
              int.width = .95,
              colors = c("deeppink",
                         "darkgreen"),
              vary.lty = T,
              line.thickness = 1,
              legend.main = "") +
  ggplot2::geom_vline(xintercept = 27.24, col = "red", linetype = 3, size = 1)+
  ggplot2::geom_vline(xintercept = 28.48, col = "red", linetype = 3, size = 1)+
  labs(title = "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore",
       subtitle = "For lower weekly rainfall, the number of URTI cases\ndereases with rise in temperature",
       x = "Weekly Median Temperature",
       y = "Number of URTI Cases",
       caption = "Source: MOH, NEA websites") +
  annotate("text", 
           x = 28, 
           y = 2000,
           label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") + 
  theme(legend.position = "top")


The regression analysis came up with two significant interaction terms.

  • First, it appears that the relationships between median temperature and number of cases of dengue is different depending on one’s gender.

  • Second, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s education.

To see the patterns of interaction, we visualized the significant interaction effects on the next chapter.

Interpretation of the Results

From our data science project, we could find the following two findings:

  1. The relationships between the tendency of people to compare themselves to others’ abilities and life satisfaction differ depending on one’s gender. Specifically, there appears to be no relationship between social comparisons orientation and life satisfaction among males. On the other hand, among females, the more they compare their abilities with others, there seems to be lesser life satisfaction. Thus, social comparison seems to be harmful for life satisfaction among females, whereas there seems to be no relationships between social comparisons and life satisfaction among males. (You might want to highlight that the relationships between social comparisons orientation and life satisfaction is based on observational study, leading to correlations, not causations, in Limitations and Future Directions section below).

  2. The relationships between social comparison and life satisfaction also depends on one’s education level. It appears that, among those who received average and high levels of education, the greater the social comparison tendency, the lower the life satisfaction. Such a negative relationship between social comparison and life satisfaction was not found among those with relatively lower levels of education. Social comparison regarding one’s abilities can hurt one’s life satisfaction, when one receives average and above-average levels of education (again, acknowledge that the findings are correlational though, thus further investigation with a/b testing should follow).

Implications

Prof. Roh’s Note: “This is where you provide the significance of the findings. Unlike the other sections, where your goal is to describe the results that you found (what the data told you). This is where you chime in and proactively discuss the meaning of the results.”

Limitations and Future Directions

Prof. Roh’s Note: “Acknowledging limitations is not where you just provide a laundry list of what is missing and what should have done. Please take the responsibility of your analyses and inform your readers to understand what the results tell and don’t (or can’t) tell. More importantly, this is the section where you technically begin your next data science project, by highlighting what would be informative down the line to shed further light on what you have found here.”

References

References

Appendix

Appendix

Contribution Statement

Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”